SiFT: uncovering hidden biological processes by probabilistic filtering of single-cell data

Cellular populations simultaneously encode multiple biological attributes, including spatial configuration, temporal trajectories, and cell-cell interactions. Some of these signals may be overshadowed by others and harder to recover, despite the great progress made to computationally reconstruct biological processes from single-cell data. To address this, we present SiFT, a kernel-based projection method for filtering biological signals in single-cell data, thus uncovering underlying biological processes. SiFT applies to a wide range of tasks, from the removal of unwanted variation in the data to revealing hidden biological structures. We demonstrate how SiFT enhances the liver circadian signal by filtering spatial zonation, recovers regenerative cell subpopulations in spatially-resolved liver data, and exposes COVID-19 disease-related cells, pathways, and dynamics by filtering healthy reference signals. SiFT performs the correction at the gene expression level, can scale to large datasets, and compares favorably to state-of-the-art methods.

Editorial Note: This manuscript has been previously reviewed at another journal that is not operating a transparent peer review scheme.This document only contains reviewer comments and rebuttal letters for versions considered at Nature Communications.
Reviewer #5 (Remarks to the Author): I believe the question raised by Reviewer 2 is satisfactorily addressed.Most questions raised by Reviewer 4 are also satisfactorily addressed, namely major comment 1, and minor comments 1, 2, and 5. Specifically, regarding major comment 1, I believe that the authors illustrated in this version an innovative way to analyze scRNA-seq data.The results are interesting and likely appealing to readers.However, minor comments 3 and 4 could be addressed more thoroughly.
Minor comment 4 mentioned, and I concur, that "quadratic time complexity" is a serious problem.Thus, the author should explicitly mention the "quadratic time complexity", and perhaps possible algorithmic tricks to mitigate it.Only mentioning that a given number of cells can be handled by a GPU is not sufficient.Toning down the claim, as the original comment recommended, in the abstract is also a good idea.I was not expecting quadratic complexity and the GPU requirement when I saw "naturally scale".In addition, the space complexity for K is also quadratic.For the Heart Cell Atlas data with 500,000 cells, this means 250G entries.I assume is addressed by the "row-wise" computation approach mentioned in the response letter, but implementation details like this that are critical to scalability, should be clearly explained in the article to support the claim.
For minor comment 3, I am not sure the scib metrics are comprehensive enough on judging the method.Specifically, corrected data are rarely used for statistical tests because the distorted distribution can lead to unreliable results.Instead, it is recommended to run tests that take covariates into consideration on raw data (Luecken 2019 Molecular Systems Biology).While this study shows that breaking this rule can lead to interesting findings, the readers should be advised that DE analyses are not systematically examined for it and the p-values need to be interpreted with caution.In fact, I would appreciate a more systematic examination/calibration of commonly used DE analysis methods on "SiFTed" data, but I can understand if the authors consider it to be out of the scope of this article.

Response to reviewers comments for manuscript NCOMMS-23-50986-T
We thank the reviewer for their constructive feedback for our manuscript.We have addressed all remaining comments and revised the manuscript accordingly.We provide a point-by-point response below.

Point-by-point response to Reviewer #5 comments
Remarks to the Author I believe the question raised by Reviewer 2 is satisfactorily addressed.
Most questions raised by Reviewer 4 are also satisfactorily addressed, namely major comment 1, and minor comments 1, 2, and 5. Specifically, regarding major comment 1, I believe that the authors illustrated in this version an innovative way to analyze scRNA-seq data.The results are interesting and likely appealing to readers.However, minor comments 3 and 4 could be addressed more thoroughly.

Comments
1. Minor comment 4 mentioned, and I concur, that "quadratic time complexity" is a serious problem.Thus, the author should explicitly mention the "quadratic time complexity", and perhaps possible algorithmic tricks to mitigate it.Only mentioning that a given number of cells can be handled by a GPU is not sufficient.Toning down the claim, as the original comment recommended, in the abstract is also a good idea.I was not expecting quadratic complexity and the GPU requirement when I saw "naturally scale".In addition, the space complexity for K is also quadratic.For the Heart Cell Atlas data with 500,000 cells, this means 250G entries.I assume is addressed by the "row-wise" computation approach mentioned in the response letter, but implementation details like this that are critical to scalability, should be clearly explained in the article to support the claim.
A: We thank the reviewer for this comment.To address this point we have toned down the "naturally scale" statement from the abstract and verified that in every reference to scalability in the main text the dependance on GPU support is mentioned.In addition, we have revised the methods section "runtime considerations" to include implementation details concerning the framework's scalability to support our claims.Specifically we relate to the following: a. row-wise batching; as GPU memory is limited, similarly to the notion of batches in training neural networks we implemented a row-wise batching mechanism in SiFT.Importantly, this is naturally incorporated in the implementation and does not require any user engagement.Algorithmically this is sensible as by definition the normalization in SiFT is only required over rows of the kernel.
b. sparse matrices support; In addition, whenever possible we perform computation over sparse matrices.This may be the case, for example, if the input "count matrix" is sparse and a knn kernel is computed over it.Again, this is the default performance of SiFT and does not require any additional input from the user.
2. For minor comment 3, I am not sure the scib metrics are comprehensive enough on judging the method.Specifically, corrected data are rarely used for statistical tests because the distorted distribution can lead to unreliable results.Instead, it is recommended to run tests that take covariates into consideration on raw data (Luecken 2019 Molecular Systems Biology).While this study shows that breaking this rule can lead to interesting findings, the readers should be advised that DE analyses are not systematically examined for it and the p-values need to be interpreted with caution.In fact, I would appreciate a more systematic examination/calibration of commonly used DE analysis methods on "SiFTed" data, but I can understand if the authors consider it to be out of the scope of this article.
A: We agree with the disclaimer the reviewer raises regarding DE tests.However, in general, there still isn't a consensus regarding optimal DE analysis for single-cell data and furthermore the robustness across datasets for DE tools is low [1,2,3].In addition, given the increase in complexity of single-cell datasets and ongoing attempts to integrate large scale atlases across samples, dedicated papers have addressed a more specific question -DE analysis over integrated data, which is related to our setting as the discussion requires addressing data correction.For example Nguyen et al. [4] compared various workflows for DE analysis of scRNA-seq data with multiple batches in diverse settings including post batch effect correction.However, the complexity of the setting limits the ability to derive a "rule of thumb" applicable to all cases.Hence, in SiFT we took an explorative approach which, as stated by the reviewer, showed that DE analysis over corrected data can lead to interesting biological findings.Thus, while performing a deeper examination of this is intriguing, we indeed, as the reviewer notes, find it beyond the scope of this article, and an interesting direction for future work, as we now note in the revised discussion.